arXiv:1506.02327vl [cs.CL] 7 Jun 2015 


A Multi-layered Acoustic Tokenizing Deep Neural Network (MAT-DNN) for 
Unsupervised Discovery of Linguistic Units and Generation of High Quality Features 

Cheng-Tao Chung*^, Cheng-Yu Tsai^^, Hsiang-Hung Lu^^, 

Yuan-ming Liou^^, Yen-Chen Wu^^, Yen-Ju Hung-yi Lee^'^ and Lin-shan Lee^^ 

Graduate Institute of Eleetieal Engineering, National Taiwan University* 

Graduate Institute of Communieation Engineering, National Taiwan University^ 

f01921031@ntu.edu.tw^, r02942067@ntu.edu.tw^, r03942039@ntu.edu.tw^, qxesqxes@gTnail.com^, 
r0394204 4@ntu.edu.tw®, r03942063@ntu.edu.tw®, tlkagk.b93901106@gmail. com'^, lslee@gate.sinica.edu.tw® 


Abstract 

This paper summarizes the work done by the authors for the Zero 
Resource Speech Challenge organized in the technical program of 
Interspeech 2015. The goal of the challenge is to discover linguistic 
units directly from unlabeled speech data. The Multi-layered Acous¬ 
tic Tokenizer (MAT) proposed in this work automatically discovers 
multiple sets of acoustic tokens from the given corpus. Each acous¬ 
tic token set is specified Iw a set of hyperparameters that describe 
the model configuration. These sets of acoustic tokens carry differ¬ 
ent characteristics of the given corpus and the language behind thus 
can be mutually reinforced. The multiple sets of token labels are 
then used as the targets of a Multi-target DNN (MDNN) trained on 
low-level acoustic features. Bottleneck features extracted from the 
MDNN are used as feedback for the MAT and the MDNN itself. We 
call this iterative system the Multi-layered Acoustic Tokenizing Deep 
Neural Network (MAT-DNN) which generates high quality features 
for track 1 of the challenge and acoustic tokens for track 2 of the 
challenge. 

Index Terms: zero resource, unsupervised learning, dnn, hmm 

1. Introduction 

Human infants acquire knowledge of a language by mere immersion 
in a language speaking community. The process is not yet completely 
understood, and is difficult to be reproduced by current automatic 
speech recognition (ASR) technologies where the dominant paradigm 
is supervised learning with large human-annotated data setsfTI. The 
idea behind the Zero Resource Speech Challenge is to inspire the de¬ 
velopment of speech recognition under the extreme situation where a 
whole language has to be learned from scratch (2][3]. The goal of this 
challenge is to find linguistic units directly from raw audio with no 
knowledge of the language, the speaker, or any other supplementary 
information. This challenge includes two tracks which focuses on 
subword units and word units respectively. In the first track of unsu¬ 
pervised subword modeling, the aim is to construct a framewise fea¬ 
ture representation of speech sounds, that is robust to within-speaker 
and across-speaker variation. Dynamic Time Warning (DTW) is per¬ 
formed on sequences of these features for predefined phone pair in¬ 
tervals to extract the warping distance. The performance of the fea¬ 
ture is evaluated using the ABX discriminability (t| on within and 
across-speaker phone pairs. The second track focuses on discovery 
of word units and the aim is to extract timing information of such 
word units in the hypothesized vocabularies derived from the speech 
corpus. The intervals in which each word unit appears in the corpus is 
then evaluated on parsing, clustering and matching quality Q. This 
paper serves as the documentation for the work by a team organized 
m National Taiwan University submitted to the challenge within the 
Interspeech 2015 technical program. 

In this work, we propose a completely unsupervised framework 
of Multi-layered Acoustic Tokenizing Deep Neural Network (MAT- 
DNN) for the task. A Multi-layered Acoustic Tokenizer (MAT) is 
used to generate multiple sets of acoustic tokens. Each acoustic to¬ 
ken set is specified by a pair of hyperparameters representing model 
granularities of the tokens. As a naming convention, we call an acous¬ 
tic token set obtained from a hyperparameter pair a layer. Each layer 
carries complementary knowledge about the corpus and the language 
behind(6). Since it is well known that speech signals have multi-level 
structures including at least phonemes and words which are helpful in 
analysing or decoding speech (7), these sets of acoustic tokens can be 
further mutually reinforced!^ ■ The multi-layered token labels gener¬ 
ated by the MAT are then used as the training targets of a Multi-target 
Deep Neural Network^ (MDNN) to learn the framewise bottleneck 
featuresQOj (BNEs). The BNFs are then used as feedback to both 


the MAT and the MDNN in the next iteration. The BNFs from the 
MDNN are evaluated in Track 1, while the time intervals for acoustic 
tokens obtained in the MAT are evaluated in Track 2. 


2. Proposed Approach 

2.1. Overview of the proposed framework 

The framework of the approach is shown in Fi^ In the left part, 
the Multi-layered Acoustic Tokenizer (MAT) produces many sets of 
acoustic tokens using unsupervised HMMs, each describing different 
aspects of the given corpus. These tokens are specified by two hyper¬ 
parameters describing HMM configurations. A set of acoustic tokens 
IS obtained for each configuration by iteratively optimizing the token 
models and the token labels on the given acoustic corpus. Multiple 
pairs of hyperparameters were selected producing multi-layered to¬ 
ken labels for the given corpus to be used as the training targets of 
the Multi-target Deep Neural Network (MDNN) on the right part of 
FigfTI The MDNN on the right learns its parameters based on the 
mum-layered token labels for the given corpus as its targets from the 
MAT on the left, so the knowledge carried by different token sets 
on different layers are fused. Bottleneck features are then extracted 
from this MDNN. In the first iteration, some initial acoustic features 
are used for both the MAT and the MDNN. This gives the first set 
of bottleneck features. These bottleneck features are then used as 
feedback to both the MAT (to replace the initial acoustic features) 
and the MDNN (to be concatenated with the initial acoustic features 
to produce tandem features) in the second iteration. Such feedback 
can be continued iteratively. The comifiete framework is referred to 
as Multi-layered Acoustic Tokenizing Deep Neural Network (MAT- 
DNN) in this paper. The output of the MDNN (bottleneck features) is 
evaluated in Track 1 of the Challenge, while the time intervals for the 
acoustic token labels at the output of the MAT are evaluated in Track 
2 of the Challenge. 

2.2. Multi-layered Acoustic Tokenizer 

The goal in this step is to obtain multiple sets of acoustic tokens, each 
defined by some hyperparameters, which capture complementary as¬ 
pects of the corpus. There is no knowledge regarding the corpus at 
all, so the process here is completely unsupervised. 

2.2.1. Unsupervised Token Discovery for Each layer of MAT 

Using unsupervised HMMs, it is straight forward to discover acous¬ 
tic tokens from the corpus for a chosen hyperparameter pair ip that 
determines the HMM configuration (number of states per model and 
number of distinct models) I11II12I[1^I14II151 . This can be achieved 
by first finding an initial label set uq based on a set of assumed tokens 
for all features in the corpus X as in OGa. Then in each iteration 
t the HMM parameters of can be trained with the label set ujt-i ob¬ 
tained in the previous iteration as in and the new label set LJt can 
be obtained by token decoding with the obtained parameters df as in 
0. 
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Figure 1: The proposed framework of Multi-layered Acoustic Tokenizing Deep Neural Network (MAT-DNN) 


The training process can be repeated with enough number of itera¬ 
tions until a converged set of token FIMMs is obtained. The processes 
O,® are referred to as token model optimization and token label op- 
timimion in the left part of Figj^ 

2.2.2. Granularity Space of Multi-layered Acoustic Token Sets 

The process explained above can be performed with different FIMM 
configurations, each characterized by two hyperparameters: the num¬ 
ber of states ra in each acoustic token FIMM, and the total number 
of distinct acoustic tokens n during initialization, i/i = (m, n). The 
transcription of a signal decoded with these tokens can be consid¬ 
ered as a temporal segmentation of the signal, so the HMM length 
(or number of states in each FIMM) m represents the temporal gran¬ 
ularity. The set of all distinct acoustic tokens can be considered as 
a segmentation of the phonetic space, so the total number n of dis¬ 
tinct acoustic tokens represents the phonetic granularity. This gives a 
two-dimensional representation of the acoustic token configurations 
in terms of temporal and phonetic granularities as in Fig[^ Any point 
in this two-dimensional space in Figj^corresponds to an acoustic to¬ 
ken configuration. Acoustic tokens m different layers have different 
model granularities that extract complementary characteristics of the 
corpus and the language behind, so they jointly capture knowledge 
about the corpus. Although the selection of the hyperparameters can 
be arbitrary in the above two-dimensional space, here we can select 
M temporal granularities (m=mi,m 2 ,...mjvf) and N phonetic granu¬ 
larities (n=ni,n 2 ,...n]y), forming a two-dimensional array of M x N 
hyperparameter pairs in the granularity space. 


Phonetic Granulerity(n): 

Number of acoustic pattern HMMs 



Figure 2: Model granularity space for FIMM configurations 


2.3. Mutual Reinforcement of Multi-layered Tokens 

Because all the layers obtained in the MAT above are learned in 
an unsupervised fashion, they are not precise. But we have many 
layers, each corresponding to a different pair of hyperparameters 
= (m,n), so they can be mutually reinforced. This is explained 
here and shown in Figl^ including token boundary fusion and LDA- 
based token label re-inmalization as in Figj^a). 


2.3.1. Token Boundary Fusion 

Figl^b) shows the token boundary when a part of an utterance is seg- 
meimd into acoustic tokens on different layers with different hyper¬ 
parameter pairs i/i = (m,n). We define aboundary function bm.nO) 
on each layer with i/i = (m, n) for the possible boundary between ev¬ 
ery pair of two adjacent frames within the utterance, where j is the 
time index of such possible boundaries. On each layer 6m,n(i)=l if 
boundary j is a token boundary and 0 otherwise. All these boundary 
functions bm,n{j) for all different layers are then weighted and av¬ 
eraged to give a joint boundary function B{j). The weights consider 
the fact that smaller m or shorter FIMMs generate more boundaries. 
The peaks of B (j ) are then selected based on the second derivatives 
and some filtering and thresholding process. This gives the new seg¬ 
mentation of the utterance as shown at the bottom of Figj^b). 

2.3.2. LDA-based Token Label Re-initialization 

As shown in Fig[^c), each new segment obtained above usually con¬ 
sists of a sequent of acoustic tokens on each layer based on the to¬ 
kens defined on that layer. We now consider all the tokens on all the 

MN 

different layers as different words, so we have a vocabulary of tii 

i=l 

words, i.e., there are words on the i-th layer and there are a total 
of MN layers. A new segment here is thus considered as a docu¬ 
ment (bag-of-words) composed of words (tokens) collected from all 
different layers. Latent Dirichlet Allocation llfil (LDA) is preformed 
for topic modeling, and then each document (new segment) is labeled 
with the most probable topic. Because in LDA a topic is character¬ 
ized by a word distribution, here a token distribution across different 
layers may also represent a certain acoustic characteristics or a certain 
acoustic token. By setting the number of topics in LD A as the number 
of distinct tokens n (n=n\,n 2 ,.^M) ss in subs ectionl 2.2.2t we have 
a new initial label set cuq as in (Tj of subsection|2.2.l| m wnich each 
new segment obtained here is a new acoustic toKen whose ID is the 
topic ID obtained by LDA. This new initial label set ojq is then used 
to re-train all the acoustic tokens on all layers of MAT as in 000- 



Figure 3: Mutual reinforcement of multi-layered tokens: (a) 
block diagram, (b) token boundary fusion, and (c) a new seg¬ 
ment considered as a document (bag-of-words) and a token as a 
word in LDA based token label re-initialization. 























































































































































Method 

English 

Tsonga 

across 

within 

across 

withir 

(1) 

Baseline 

28.10 

15.60 

33.80 

19.10 

(2) 

MFCC 

28.63 

15.89 

30.77 

16.34 

(3) 

DBM posterior 

25.96 

15.74 

29.15 

16.18 

(4) 

BNF-1st, MR-0 

26.84 

15.95 

26.48 

15.52 

(5) 

BNF-1st, MR-1 

23.88 

14.60 

21.97 

13.40 

(6) 

BNF-1st, MR-2 

24.46 

14.92 

22.14 

13.31 

(7) 

BNF-2nd, MR-0 

26.55 

16.27 

26.23 

15.05 

(8) 

BNF-2nd, MR-1 

24.53 

15.13 

23.30 

13.88 

(9) 

BNF-1st, MR-1* 

21.92 

13.95 

21.42 

12.84 

(10) 

BNF-2nd, MR-1* 

24.13 

15.24 

23.05 

14.03 

(11) 

Topline 

16.00 

12.10 

04.50 

03.50 


Table 1: Results for Track 1 of the challenge, the best figure for 
each metric is shown in bold. 


2.4. The Multi-target DNN (MDNN) 

As shown in the right part of Fig[T] token label sequence from a layer 
(with a pair of hyperparameters if} = (m, n)) is a valid target for 
supervised framewise training, although obtained in an unsupervised 
way. In the initial work here, we do not use the HMM states as the 
target, but simply take the token label as the training target. As shown 
in Fig[2 there are multi-layered token labels with different hyperpa¬ 
rameter pair ■0 = (m, n) for each utterance, so we jointly consider all 
the multi-layered token labels by learning the parameters for a single 
DNN with a uniformly weighted cross-entropy objective at the out¬ 
put layer. As a result, the bottleneck feature (BNF) extracted from 
this DNN automatically fuse all knowledge about the corpus and the 
language behind learned from the different sets of acoustic tokens. 

2.5. The Iterative Learning Framework for MAT-DNN 

Once the BNFs are extracted from the MDNN in iteration 1, they 
can be taken as the input of the MAT on the left of FigfTJc) replac¬ 
ing the initial acoustic features. The MAT then generis updated 
sets of multi-layered token labels and these updated sets of multi¬ 
layered token labels can be used as the updated training objective of 
the MDNN. The input features of the MDNN can also be updated by 
concatenating the initial acoustic features with the newly extracted 
BNFs as the tandem features. This process can be repeated for sev¬ 
eral iterations until satisfactory results are obtained. The tandem fea¬ 
ture used as the input of the MDNN can be further augmented by 
concatenating unsupervised features obtained in other systems such 
as the Deep Boltzmann Machine ll7l (DBM) posteriorgrams, Long- 
Short Term Memory Recurrent Neural NetworkpS) (LSTM-RNN) 
autoencoder bottleneck features, and i-vectors |191 trained on MFCC. 
Although different from the conventional recurrent neural network 
(RNN) in which the recurrent structure is included in back propaga¬ 
tion training, the concatenation of the bottleneck features with other 
features in the next iteration in MDNN is a kind of recurrent structure. 


3. Experimental Setup 

The general framework of the MAT-DNN presented above allows sev¬ 
eral flexible configurations. However, in this work we train the MAT- 
DNN in the following manner. We set m=3, 5, 7, 9 states per token 
HMM and n=50, 100, 300, 500 distinct tokens in the MAT, which 
gives a total of 16 layers. 

In the first iteration, we use the 39 dimension Mel-frequency 
Cepstral Coefficients (MFCC) with energy, delta and double delta 
as the initial acoustic features for the input to both the MAT and 
the MDNN. We tandem the MFCC with a window of 4 frames 
before and after (39x9 dimensions), and an i-vector (400 dimen¬ 
sions) trained on the MFCC of each evaluation interval for the in¬ 
put of the MDNN. The topology of the DNN is set to be 751(input)- 
256(hidden)-256(hidden)-39(bottleneck)-(target) with 3 hidden lay¬ 
ers. Even without the feedback and tandem features, the MAT-DNN 
is a powerful self-contained unsupervised feature extractor. We com¬ 
pared the BNF extracted in the first iteration wi th the Deep Boltzmann 
Machine posteriorgrams mentioned in section [O] that use the same 
MFCC as input. To make the comparison faipwe keep the dimen¬ 
sionality of these features to be 39. For the Deep Boltzmann Ma¬ 
chine, we used the 39-dimension MFCC with a window of 5 frames 
before and after as the input. The configuration we used for the DBM 
is 429(visible)-256(hidden)-256(hidden)-39(hidden). We originally 


extracted another set of LSTM-RNN autoencoder bottleneck features 
as another baseline but the performance was slightly worse than the 
MFCC thus we omit it in any discussion here. 

In the second iteration, we tandem the original MFCC, the BNF 
extracted from the first iteration, the DBM posteriorgrams, and the i- 
vector forming a (39x9+39x9+39x9-1-400=1453) dimension input to 
the MDNN. We used the updated transcriptions as the target and 
extracted the BNF as the features. The MAT is trained using the 
zrst l20T a python wrapper for the HTK toolkit l21l . srilm l22l ~ that 
we developed for training unsupervised HMMs with varying model 
granularity. The LDA tool we used in the Mutual Reinforcement is 
done with MALLET I23I . The MFCC were extracted using the HTK 
toolkit l2li . The i-vectors were extracted using Kaldi l24l . The DBM 
posteriorgram is extracted using hbdnn l25l . The MDNN was trained 
using Caffe l26l . 

3.1. Track 1 

The two official corpora are the Buckeye corpus EUandNCHLTXit- 
songa Speech corpus ED in English and Tsonga respectively. Thw 
are used in the evaluation based on the ABX discriminability test (4] 
including across and within speaker tests. The final results is in error 
percentage, which means the lower the better. Our results of track 1 
is presented in Tabled 

Rows (1) and (11) are the official baseline MFCC features and of¬ 
ficial topline supervised phone posteriorgrams provided by the chal¬ 
lenge organizers respectively. Row (2) is our baseline of the MFCC 
features, the initial acoustic features used to train all systems in this 
work. Row (3) is for the DBM posteriorgrams extracted from the 
MFCC of row (2), serving as a strong unsupervised baseline. The 
results in rows (4), (5) and (6) are the performance of the bottleneck 
features extracted in the first iteration of the MAT-DNN without ap¬ 
plying mutual reinforcement (MR) (4), applying MR once (5), and 
twice (6) respectively. Row (9) is similar to row (5), except we use 
a wider bottleneck layer with 256 dimensions instead of 39. Rows 
(7) and (8) are the performance of the bottleneck features extracted in 
the second iteration of the MAT-DNN without applying MR (7) and 
applying MR once (8). The MAT of the MAT-DNN in (7) and (8) 
is trained using the BNF of row(5). Row (10) is similar to row (8), 
except only the MFCC and i-vectors are tandemed as input without 
other features. 

All the features from row (2) to (10) except for (9) are confined 
to 39 dimensions. This allows fast and fair comparison of different 
algorithms. We observe that as a stand-alone feature extractor with¬ 
out any iterations, the MAT-DNN in row (5) outperforms the DBM 
baseline in (3). The effect of mutual reinforcement can be seen in the 
improvement from row (4) to row (5)(6) and row (7) to row(8). We 
observe that a single iteration of mutual reinforcement of the target of 
the MAT-DNN is enough to bring huge improvement to the system. 
The effect of iterations in the MAT-DNN can be seen by comparing 
rows (2), (5), (8), respectively corresponding to 0, 1, and 2 iterations. 
Although the performance improvement from row (2) to row (5) is 
notable, it dropped in the second iteration in (8). To investigate rea¬ 
sons of the performance drop, we widened the bottleneck feature to 
256 dimensions in (9) and observed a dramatic improvement in per¬ 
formance. It is possible that we have not explored the full potential 
of the MAT-DNN as comparison between algorithms was the origi¬ 
nal goal when we designed the experiments. For a better tuned set 
of parameters, improvement in following iterations is to be expected 
on track 1. Nonetheless, the benefit of the second iteration is better 
observed in track 2. 

3.2. Track 2 

The evaluation tool for track 2 provided by the challenge 
organizers(3 gives five main metrics plus two more scores: NED and 
coverage. FigW shows the results for (a) English and (b) Tsonga in 
NED, as well as the F-measures for the five main metrics: matching, 
grouping, type, token, and boundary, each in a subgraph. We omit 
coverage here because it is almost 100% in all cases. So there are six 
subfigures in FigWa) and (b). In each subfigure, the results for four 
cases are shown, they correspond to the four MAT targets used for 
the MDNN bottleneck features listed in rows (4), (5), (6) and (8) of 
Table [2 For each of these token sets, the three or six groups of bars 
correspond to different values of m (m=3, 5, 7 or m=3, 5, 7, 9, 11, 
13), while in each group the four bars con'espond to the values of n 
(n=50, 100, 300, 500 from left to right), where ijj = (m, n) are the 
parameters for the token sets. Those bars in blue are better than the 
JHU baseline, while those in white are worse. Only the results jointly 
considering both within and across talker conditions are shown. 

From FigQa) for English, it can be seen that the proposed token 
sets perform \vell in type, token and boundary scores, although much 
worse in matching and grouping, we see in many cases the benefits 
brought by MR (e.g. (6) vs (5) in type of Figffla)) and the second 
iteration (e.g. (8) vs (6) in boundary of Fig[^a))7especially for small 
values of m. In many groups for a givenm, smaller values of n 
seemed better, probably because n=50 is close to the total number of 
phonemes in the language. Also, a general trend is that larger values 























(%) 

NED 

Cov. 

Matching 

Grouping 

Type 

Token 

Boundary 

P 

R 

F 

P 

R 

F 

P 

R 

F 

P 

R 

F 

P 

R 

F 

Eng. 

JHU 

21.9 

16.3 

39.4 

1.6 

3.1 

21.4 

84.6 

33.3 

6.2 

1.9 

2.9 

5.5 

0.4 

0.8 

44.1 

4.7 

8.6 

(A) 

(4) BNF-lst, MR-0 

V' = (7,50) 

87.5 

100 

1.4 

0.5 

0.8 

3.6 

18.7 

6 

4.2 

11.9 

6.2 

8.3 

15.7 

10.9 

35.2 

84.6 

49.8 


JHU 

12 

16.2 
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60.2 
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13.1 
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3.9 
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Table 2: Comparison of three typical example token sets selected out of all shown in Fig|^with the JHU baseline. Those better than 
JHU baseline are in bold. 
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Figure 4: Results for Track 2 for (a) English and (b) Tsonga. Each subgraph is an evaluation measure for four cases of token sets used 
to train the bottleneck features listed in four rows of Table[T]as shown at the bottom. The four bars in each group for a value of m are 
for n=50, 100, 300, 500 from left to right (not shown in the hgure) and ‘ip = (m, n) are parameters for the token sets. Blue, yellow 
and white bars correspond to better, equal to or worse as compared to the JFIU baseline at the upper left comer of each subgraph. The 
coverage is not shown because it is almost 100% in all cases. 


of m were better, probably because HMMs with more states were 
better in modelling the relatively long units; this may directly lead to 
the higher type, token and boundary scores. 

Similar observations can be made for Tsonga in FigQb), and 
the overall performance seemed to be even better as the proposed to¬ 
ken sets perform well even in matching scores. The improvements 
brought by MR, the bottleneck features and the second iteration is 
better observed here, which gives the best cases for all the five main 
scores. This is probably due to the fact that more sets of tokens were 
available for MR and MAT-DNN on Tsonga than English. We can 
conclude from this observation that more token sets introduces more 
robustness and that leads to better token sets for the next iteration. 
When m goes to 13, we see that without MR in (4) of Figj^b)) al¬ 
most all metrics degrade except for matching scores, but wth MR 
almost all the scores consistently increases (except for NED) when 
m becomes larger. This suggests that MR can also prevent degrada¬ 
tion from happening while detecting relatively long units. 

We also selected three typical example token sets (A)(B)(C) out 
of the many proposed here and shown in FigH and compared them 
with the JHIJ baseline l29l in Tablej^including Precision (P), Recall 
(R) and F-scores(F). These three example sets are also marked in 
Fig[2 In Table 1^ those better than JHIJ baseline are in bold. The 


much higher NED and coverage scores suggest that the proposed ap¬ 
proach is a highly permissive matching algorithm. The much higher 
parsing scores (type, token and boundary scores), especially the Re¬ 
call and F-scores, imply the proposed approach is more successful 
in discovering word-hke units. However, the matching and group¬ 
ing scores are much worse probably because the discovered tokens 
cover almost the whole corpus, including short pauses or silence, and 
therefore many tokens are actually noises. Another possible reason 
might be that the values of n used are much smaller than the size of 
the real word vocabulary, making the same token label used for signal 
segments of varying characteristics and this degenerated the grouping 
qualities. 

4. Conclusion 

This paper summarizes the preliminary work done for the Zero Re¬ 
source Speech Challenge in Interspeech 2015. We propose a MAT- 
DNN to generate multi-layer token sets and fuse the various knowl¬ 
edge in different token sets in the bottleneck features. We present 
the complete results on all evaluations we tested up to the submission 
deadline, with a hope that these results serve as good references for 
future investigations. 
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